Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tesseract training setup scripts and example data #339

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

Penguin2600
Copy link

Work In progress, opening for visibility.

Current status:

tessTrain/tessTrain.sh - works and will set up a baseline ubuntu 22.04 wsl / container / etc with the tools and binaries required for training tesseract. It will also run an example training session with the included example training data. Documentation and sources are commented inside the script for further details look there for now.

tessTrain/example_truth/ - Example of what a training data directory needs to look like. Used by tessTrain.sh to confirm that setup was successful.

Ping me on discord for any questions or comments! Thx.

Copy link
Owner

@TwoAbove TwoAbove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Left a couple of minor comments.

I also have a question about dataScripts/tessTrain/example_truth/97984949.png and similar ones. Would the extra M hinder the training in any way?

Comment on lines +25 to +27
sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev
sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
sudo apt-get install libpango1.0-dev libleptonica-dev
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to extract this into the README as a ## training tesseract section.

I would split this script into two parts - a setup.sh script (also mention it in the README in the setup instructions) and a train.sh script that takes in a ground truth path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally, I had the same thought. one will likely run once while the other may need many runs.

greentext "Installing Deps and Creating File Structure"

# Dont polute the directory
mkdir -p ./tess
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this script creates artifacts, we'll need to add them to a .gitignore file. Ideally, we would keep the sole .gitignore so it's consolidated in one place.

Copy link
Author

@Penguin2600 Penguin2600 Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Callout, I'll consider what the new entries might need to be.

dataScripts/tessTrain/tessTrain.sh Outdated Show resolved Hide resolved
Comment on lines +37 to +39
sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These too


greentext "Pulling the required ENG traineddata from github"
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
sudo mv eng.traineddata /usr/local/share/tessdata
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to not populate paths outside of noitool? It would be great if this would be confined to this directory. It looks like you can use TESSDATA variable to make it local. https://github.com/tesseract-ocr/tesstrain?tab=readme-ov-file#train

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a great idea, will incorporate.

@Penguin2600
Copy link
Author

I also have a question about dataScripts/tessTrain/example_truth/97984949.png and similar ones. Would the extra M hinder the training in any way?

Short Answer:
I don't know.

Long answer:
My best understanding is that if it tags that partial M as a glyph (box generation step) then it may, in the training, try and label it. Which it will almost certainly fail at since it is not present in the corresponding gt.txt file. This will increase the resulting training error due the mismatch which may in a sort of artificial way cause that training data to be ignored in the training. Ideally we should give it perfectly clean images and a perfectly clean matching ground_truth text. In this case depending on the training execution variables I see 90-98% success rates on validation data out of the minimal example training set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants